Communications Medicine
○ Springer Science and Business Media LLC
Preprints posted in the last 7 days, ranked by how well they match Communications Medicine's content profile, based on 85 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Khattab, A.; Wang, Z.; Srinivasasainagendra, V.; Tiwari, H. K.; Loos, R.; Limdi, N.; Irvin, M. R.
Show abstract
BackgroundDiabetic kidney disease (DKD) is a leading cause of kidney failure in individuals with type 2 diabetes (T2D), yet risk identification in routine clinical practice remains incomplete. A critical and often overlooked barrier is risk observability: how much of a patients underlying risk is actually captured in their clinical record at the time of screening. Existing prediction models evaluate performance using model-specific thresholds, making it difficult to understand how additional data sources alter real-world screening behavior or which individuals benefit when models are expanded. MethodsWe developed a series of five nested machine learning models evaluated at a one-year landmark following T2D diagnosis using data from the All of Us Research Program (N = 39,431; cases = 16,193). Each successive model added a distinct information layer -- intrinsic risk, laboratory snapshots, medication exposure, longitudinal care trajectories, and social determinants of health (SDOH) -- while retaining all prior features. All models were evaluated under a fixed screening policy targeting 90% specificity, so that the false positive rate remained constant as the information available to the model grew. External validation was conducted in the BioMe Biobank (N = 9,818) without retraining. ResultsDiscrimination improved consistently across layers, from AUROC 0.673 (M1) to 0.797 (M5). Under the fixed screening policy, sensitivity nearly doubled from 0.27 to 0.49, with a cumulative recovery of 30.4% of cases missed by the base model. Gains were driven by distinct subgroups at each transition: laboratory features identified biologically high-risk individuals; medication features captured those with high treatment intensity reflecting advanced cardiometabolic burden; longitudinal care trajectory features rescued cases with biological instability observable only through repeated measurements; and SDOH features recovered individuals with limited clinical observability, with rescue probability highest among those with the fewest recorded monitoring domains. Sparse data in the clinical record indicated low observability, not low risk. Social and genetic features each contributed most when downstream physiologic signal was limited, supporting a contextual rather than universal role for each. In BioMe, discrimination was attenuated (M4 AUROC 0.659), but the relative ordering of information layers was fully preserved, and a systematic upward shift in predicted probability distributions underscored the need for recalibration before deployment in a new setting. ConclusionsDKD risk detection in T2D is substantially improved by integrating complementary information layers under a fixed clinical screening policy, with gains arising from distinct domains that identify at-risk individuals in different clinical contexts. The layered landmark framework introduced here reveals how risk observability -- shaped by monitoring intensity, healthcare engagement, and access -- determines what a screening model can detect, and provides a foundation for context-aware EHR-based screening that accounts for data availability at the time of risk assessment. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=140 SRC="FIGDIR/small/26351384v1_ufig1.gif" ALT="Figure 1"> View larger version (51K): org.highwire.dtl.DTLVardef@1cc7f4borg.highwire.dtl.DTLVardef@b92956org.highwire.dtl.DTLVardef@48ffbcorg.highwire.dtl.DTLVardef@8dc627_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical abstract.C_FLOATNO Study design and layered DKD screening framework The top row defines the cohort timeline, in which predictors are derived from clinical data collected between T2D diagnosis and the 1-year landmark, and incident DKD is ascertained after the landmark. The second row depicts the nested model architecture, in which five successive models sequentially incorporate intrinsic risk, laboratory snapshot features, medication exposure, longitudinal care trajectories, and social determinants of health, while retaining all features from prior layers. The third row summarizes model development in the All of Us Research Program (N = 39,431) and external validation in the BioMe Biobank (N = 9,818), where the same trained models and risk thresholds were applied without retraining. The bottom row highlights the three evaluation domains: predictive performance, fixed-policy screening, and missed-case recovery context. DKD, diabetic kidney disease; T2D, type 2 diabetes; PRS, polygenic risk scores; AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; PPV, positive predictive value; SHAP, SHapley Additive exPlanations. C_FIG
Bahig, S.; Oughton, M.; Vandesompele, J.; Brukner, I.
Show abstract
In dense urban settings, delays between diagnostic sampling and effective isolation can sustain transmission during peak infectiousness. We define a waiting-window transmission externality arising when infectious individuals remain mobile while awaiting results, formalized as E = N{middle dot}P{middle dot}TR{middle dot}D, where N is daily testing volume, P test positivity, TR transmission during the waiting period, and D turnaround time. Using Monte Carlo simulation and a susceptible-infectious-recovered (SIR) framework, we quantify excess infections per 1,000 tests/day under multiple diagnostic workflows. A surge scenario incorporates positive coupling between TR and D ({rho} = 0.45), reflecting co-occurrence of laboratory saturation and elevated contacts during system stress. Under centralized 48-hour workflows, excess infections reach [~]80 at P = 10% and [~]401 at P = 50%, increasing to [~]628 under surge conditions. In contrast, near-patient rapid testing and home sampling reduce this to [~]5 and [~]25-26, respectively. Workflows that eliminate the waiting window--either through immediate isolation at sampling or through home-based PCR that returns results at the point of collection--effectively collapse the transmission term. These findings identify diagnostic delay as a modifiable driver of epidemic dynamics. Operational redesign of testing workflows, including decentralized sampling and home-based molecular diagnostics, offers a scalable pathway to improve epidemic controllability and reduce inequities in dense urban environments.
Schmidt, C.; Samartsidis, P.; Seaman, S.; Emmanouil, B.; Foster, G.; Reid, L.; Smith, S.; De Angelis, D.
Show abstract
To minimise health disparities, equitable access to medical treatment is paramount. In a pioneering intervention, National Health Service Englands Hepatitis C virus (HCV) programme has implemented country-wide peer support to boost treatment access. Peer support workers (peers) are individuals with relevant lived experience, who promote testing and treatment in marginalised populations underserved by traditional health services. We evaluated the English peers intervention, exploiting its staggered rollout and rich surveillance data between June 2016 and May 2021. Peers increased HCV cases identified by 13{middle dot}9% (95% credible interval (95% CrI) [5{middle dot}3, 21{middle dot}7]), sustained viral responses by 8{middle dot}0% (95% CrI [-4{middle dot}4, 18{middle dot}6]), and drug services referrals by 8{middle dot}8% (95% CrI [-12{middle dot}5, 22{middle dot}6]). The interventions effectiveness was magnified during the first COVID-19 lockdown and individuals supported by peers typically belonged to populations with poor treatment access. Our findings indicate that peers can boost equity in treatment access on a national scale.
Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.
Show abstract
Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.
Wang, V.; Deng, S.; Aguilar, R.
Show abstract
BackgroundThe retired antigen hypothesis, introduced by Tuohy and colleagues, proposes that tissue-specific proteins expressed conditionally during early life or reproductive stages, then silenced in normal aging tissue, represent safe and effective cancer vaccine targets when re-expressed in tumors. To date, discovery of retired antigens has relied entirely on hypothesis-driven wet lab work, limiting throughput. MethodsHere we present RADAR (Retired Antigen Discovery and Ranking), a multi-omics computational pipeline implemented on a standard server that systematically identifies retired antigen candidates. RADAR comprises four core discovery layers integrating: 1) The Genotype-Tissue Expression Portal (GTEx) normal tissue expression, 2) TCGA tumor re-expression, 3) DNA methylation, and 4) miRNA regulatory networks, each applied sequentially to identify genes exhibiting the epigenetic and post-transcriptional hallmarks of tissue-specific retirement followed by tumor re-activation. Candidate characterization is further supported by three automated modules: 1) protein-level safety screening via the Human Protein Atlas, 2) molecular subtype enrichment analysis, and 3) cross-cancer confirmation, which execute automatically when the relevant data are available for the selected cancer type. ResultsThe pipeline independently validated known targets including alpha-lactalbumin (LALBA, the basis of the Tuohy Phase 1 triple-negative breast cancer vaccine trial) and anti-Mullerian hormone (AMH), consistent with Tuohys ovarian cancer vaccine program targeting AMHR2, and rediscovered multiple known cancer-testis antigens (MAGEA1, MAGEC1, SSX1) as positive controls. Among 4,664 initial candidates derived from GTEx, the pipeline identified 20 high-confidence retired antigen candidates passing all filters. DCAF4L2, COX7B2, TEX19, and CT83 emerge as the highest-priority novel candidates for experimental validation, demonstrating zero expression in critical somatic organs, strong epigenetic silencing, and significant re-expression across multiple cancer types. ConclusionRADAR provides the first systematic computational framework for retired antigen discovery, offering a reproducible and scalable approach to expanding the cancer immunoprevention pipeline beyond individually characterized targets. The pipeline is fully reproducible, requires no specialized hardware, and is immediately extensible to additional TCGA cancer types.
Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.
Show abstract
The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.
Dasgupta, N.; Sibley, A. L.; Gildner, P.; Gora Combs, K.; Post, L. A.; Tobias, S.; Kral, A. H.; Pacula, R. L.
Show abstract
Drug overdose deaths in the United States reached record levels during the fentanyl era before recently declining. A plausible hypothesis is that a sudden drop in fentanyl purity beginning in 2023 caused the downturn in overdose mortality. We evaluated this hypothesis by replicating a published analysis with regional overdose data, using models that account for time trends and autocorrelation, and negative control indicators to test for spurious correlation. When fentanyl purity was rising, the national purity series did not track overdose increases in most regions and showed only a modest association in the West. When both purity and mortality later declined, the observed associations were also seen with unrelated macroeconomic indicators that shared the same time pattern. National fentanyl purity alone does not provide a sufficient explanation for recent overdose declines.
Rodriguez, A. M.; The Pooled Resource Open-Access ALS Clinical Trials Consortium,
Show abstract
Standard analysis of amyotrophic lateral sclerosis (ALS) clinical trials evaluates therapeutic efficacy by comparing linear slopes of total ALS Functional Rating Scale (ALSFRS) scores between treatment arms. This approach compresses multidomain ordinal data into a single scalar trajectory, discarding distributional structure. When subgroup-level trends differ in timing or direction, such aggregation can attenuate or eliminate them, a phenomenon known as Simpsons paradox. Here we apply Shannon entropy, computed from item-level score distributions within each ALSFRS functional domain following the framework established in [8], to the PRO-ACT database, stratified by treatment arm (Active: n = 4,581; Placebo: n = 2,931; 19 monthly time points). The entropy trajectories of drug-treated and placebo populations diverge visibly and systematically across all four functional domains (Bulbar, Fine Motor, Gross Motor, Respiratory). In the Fine Motor domain, the placebo population reaches peak entropy at month 8 and reverses, while the active population does not peak until month 13, a five-month delay in the populations transit toward functional loss. This divergence is model-independent: it is present in the raw Shannon entropy trajectories before any dynamical model is applied. A permutation test shuffling patient-level arm labels (n = 1,000 permutations) confirms that the total integrated absolute divergence across all four domains exceeds the null distribution at p < 0.001 (observed: 4.48; null: 2.03 {+/-} 0.33; 7.5 standard deviations above the null mean), with Fine Motor (p = 0.001) and Respiratory (p < 0.001) individually significant. The quantity that differs between arms, the shape and timing of the populations distributional evolution, does not exist as a measurable quantity in the total-score linear-slope framework used to evaluate these trials. Whether this signal reflects genuine treatment effects, compositional artifacts from pooling heterogeneous trials, or both cannot be determined from the anonymized public database alone. What can be determined is that the standard ALS clinical trial endpoint makes an implicit assumption, that the distributional information it discards is uninformative, and the present results demonstrate empirically that this assumption is false.
Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.
Show abstract
Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.
Liu, Y.; Chen, Z.; Suman, P.; Cho, H.; Prosperi, M.; Wu, Y.
Show abstract
This study developed a large language model (LLM)-based solution to identify people at HIV risk using electronic health records. We transformed structured EHR data, including demographics, diagnoses, and medications, into narrative descriptions ordered by visit date and applied GatorTron, a widely used clinical LLM trained on 82 billion words of de-identified clinical text. We compared GatorTron with traditional machine learning models, including LASSO and XGBoost. We identified a cohort with 54,265 individuals, where only 3,342 (6%) had new HIV diagnoses. Our LLM solution, based on GatorTron, achieved excellent performance, reaching an F1 score of 53.5% and an AUC of 0.88, comparable to traditional machine learning approaches. Subgroup analysis showed that, across age, sex, and race/ethnicity groups, both LLM and traditional models achieved AUCs above 0.82. Interpretability analyses showed broadly consistent patterns across LLM models and traditional machine learning models.
Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.
Show abstract
Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.
Maldonado, A.; Heberer, K.; Lynch, J.; Cogill, S. B.; Nallamshetty, S.; Chen, Y.; Shih, M.-C.; Bress, A. P.; Lee, J.
Show abstract
ImportanceSemaglutide, a glucagon-like peptide-1 receptor agonist (GLP-1RA), is a highly effective medication to treat type 2 diabetes and obesity. However, concerns about potential suicidality persist, creating clinical uncertainty about its neuropsychiatric safety. ObjectiveTo assess risks of suicidality after initiating semaglutide compared to initiating SGLT2i and by duration of continuous semaglutide treatment. DesignActive-comparator, new-user target trial emulation to estimate inverse probability-weighted marginal cause-specific hazard ratios (HRs). For duration-of-treatment analyses, we used clone-censor-weight methods to estimate exposure-adjusted effects. SettingVeterans Health Administration. ParticipantsU.S. Veterans with type 2 diabetes receiving care from March 1, 2018 to September 1, 2025. ExposureInitiation of semaglutide vs SGLT2i; duration of semaglutide use ([≤]6, 7-12, >12 months). OutcomesIncident suicidal ideation; suicide attempt or death; and a composite outcome. ResultsA total of 102,361 Veterans met inclusion criteria, including 11,478 new initiators of semaglutide and 90,883 new initiators of an SGLT2i. After overlap weighting, baseline characteristics were well balanced between treatment groups (mean [SD] age, 60.1 [11.7] years; BMI, 37.8 [6.8] kg/m2; hemoglobin A1c, 7.0% [1.4]; 85.5% male; 61.9% non-Hispanic White). During a median follow-up of 2.2 years, 9077 incident suicidal ideation events and 696 suicide attempts or deaths occurred. The incidence rate of suicidal ideation was 56.3 and 37.7 per 1000 person-years among semaglutide initiators and SGLT2i initiators, respectively (hazard ratio [HR], 0.99; 95% CI, 0.93-1.06; P = 0.86). For suicide attempts or deaths, the incidence rates were 4.30 and 2.64 per 1000 person-years, respectively (HR, 1.05; 95% CI, 0.84-1.31; P = .86). In adherence-adjusted analyses, sustained semaglutide treatment for more than 12 months, compared with 6 or fewer months, was associated with a 74% lower risk of suicide attempts or deaths (HR, 0.27; 95% CI, 0.14-0.54; P<.001). ConclusionAmong U.S. Veterans with type 2 diabetes, initiators of semaglutide were not observed to have an increased risk of suicidality compared with initiators of SGLT2i. Those with longer semaglutide treatment (beyond 12 months) had decreased risk of suicide attempt or death, suggesting longer term treatment is safe and may protect against for those outcomes.
Lin, T.; Li, Y.; Huang, Z.; Gui, T. T.; Wang, W.; Guo, Y.
Show abstract
Target trial emulation (TTE) offers a principled way to estimate treatment effects using real-world observational data, but analyses of time-varying treatment strategies remain vulnerable to immortal time bias. The clone-censor-weight (CCW) approach is increasingly used to address this problem, yet key aspects of its causal interpretation and implementation remain unclear. In this work, we emulate a target trial using electronic health records (EHRs) to compare completion of a 3-dose 9-valent human papillomavirus vaccination (HPV) series within 12 months versus remaining partially vaccinated among vaccine initiators. We link CCW to the classic potential outcome framework in causal inference, evaluate the role of different weighting mechanisms, and account for within-subject correlation induced by cloning using cluster-robust variance estimation. Our study provides practical guidance for applying CCW in real-world comparative effectiveness studies to address immortal time bias and supports more rigorous and interpretable treatment effect estimation in TTE.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.
Murugadoss, K.; Venkatakrishnan, A. J.; Soundararajan, V.
Show abstract
Metabolic dysfunction is increasingly recognized as a risk factor for poor outcomes in breast cancer, but whether incretin-based therapies confer survival benefit beyond weight loss remains unresolved. Using a federated electronic health record platform spanning nearly 29 million patients, we evaluated breast cancer survival after semaglutide and tirzepatide initiation in routine care. In 1:1 propensity-matched pooled-comparator analyses, semaglutide was associated with improved overall survival versus metformin, sodium-glucose cotransporter 2 (SGLT2) inhibitor, and dipeptidyl peptidase 4 (DPP4) inhibitor users, with 54 deaths among 2,433 semaglutide users (2.2%) versus 395 deaths among 2,433 comparators (16.2%) over 24 months (log-rank P < 0.001). Tirzepatide showed a favorable survival association relative to pooled anti-diabetic comparators that did not meet statistical significance (P = 0.24), with 3 deaths among 220 users (1.4%) versus 64 deaths among 220 comparators (29.1%). In a head-to-head propensity-score-matched comparison, overall survival did not differ significantly between semaglutide and tirzepatide treated patients with pre-existing breast cancer (2,117 per arm; P = 0.12). In semaglutide-treated patients alive and observable at the 1-year landmark, higher maximum dose achieved was significantly associated with lower post-landmark mortality (P = 0.034), with an event rate of approximately 1.0% in the high-dose group (>=1.7 mg) versus approximately 4.5% in the low-dose group (0.25-1.0 mg). Despite a linear dose weight loss relationship for semaglutide, however, weight loss strata did not separate survival outcomes (global P = 0.22). In tirzepatide-treated patients alive and observable at the same landmark, neither maximum dose achieved nor weight loss strata separated post-landmark survival (P = 0.98 and P = 0.50, respectively). Structured EHR and AI-based clinical note analyses further showed significantly lower frequency of documented metastatic disease in semaglutide-treated patients relative to pooled anti-diabetic comparators, including any metastasis (7.0% versus 15.0%, rate ratio 0.5, P < 0.001), bone metastasis (1.0% versus 5.2%, rate ratio 0.2, P < 0.001), and liver, lung, or brain metastases (all P < 0.001). LLM-derived cause-of-death extraction further showed a 60% lower relative proportion of cancer-associated deaths in semaglutide-treated patients (19% of ascertainable deaths) than in matched pooled anti-diabetic comparators (47% of ascertainable deaths), with comparator deaths more often attributed to cancer progression involving metastatic breast cancer, leptomeningeal carcinomatosis, and cancer-driven organ failure. Overall, this study demonstrates that semaglutide use in patients with pre-existing breast cancer is associated with a dose correlated but weight loss independent improvement in overall survival. These findings motivate prospective trials of GLP-1 receptor agonists in breast cancer across various stages and treatment settings.
Palmer, M.; Hashiguchi, T.; Arman, A. C.; Shirakata, Y.; Buss, N. E.; Lalezari, J. P.
Show abstract
BackgroundChemokine receptor type 5 (CCR5) is expressed on hepatic stellate cells (HSCs), which, together with fibroblasts, are major producers of extracellular matrix during liver fibrosis. Leronlimab is a humanized IgG4{kappa} monoclonal antibody that binds to CCR5. The objective of the present study was to evaluate the antifibrotic effects of leronlimab in three independent preclinical studies using two mouse models of liver fibrosis. MethodsIn STAM (Stelic Animal Model) model 1, leronlimab was administered at doses of 5 or 10 mg/kg/week for 3 weeks. STAM model 2 was conducted as a confirmatory study to validate the antifibrotic effect observed with the 10 mg/kg/week dose in STAM model 1. In a third study, a carbon tetrachloride (CCl)-induced liver fibrosis mouse model was used to evaluate leronlimab administered at 10 mg/kg/week for 3 weeks. An isotype-matched control antibody was included in all studies for comparison. Evaluations included liver enzymes and histological assessment of liver fibrosis. ResultsIn STAM model 1, leronlimab at 10 mg/kg/week significantly reduced fibrosis area compared with the isotype control (p = 0.0005). These findings were confirmed in STAM model 2 (p < 0.0001). Consistent antifibrotic effects were also observed in the CCl-induced liver fibrosis model (p = 0.0006). ConclusionsCollectively, these preclinical results demonstrate that CCR5 blockade by leronlimab is associated with a significant reduction of established liver fibrosis in multiple mouse models and support further evaluation of leronlimab as a potential therapeutic option, either as monotherapy or in combination regimens, for chronic liver diseases with fibrosis.
Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.
Show abstract
The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.
Zhang, R.
Show abstract
Aims The oral glucose tolerance test (OGTT) is effective for detecting post-load dysglycemia, but it is burdensome and therefore not routinely used. Continuous glucose monitoring (CGM) offers a convenient way to capture real-world glucose patterns, yet it remains unclear whether CGM-derived metrics reflect OGTT-defined dysglycemia. We therefore aimed to evaluate CGM-derived and clinical metrics for predicting OGTT 2-hour glucose, classifying OGTT-defined dysglycemia, and assessing day-to-day repeatability. Methods We analyzed a cohort with paired free-living CGM and OGTT. Multiple CGM-derived metrics and clinical measures were compared for prediction of OGTT 2-hour glucose, classification of OGTT-defined dysglycemia, and day-to-day stability. Predictive performance was assessed primarily by leave-one-out (LOO) R^2, and day-to-day repeatability by intraclass correlation coefficients (ICC). Results The glycemic persistence index (GPI), a metric integrating the magnitude and duration of glycemic elevation, was the strongest single predictor of OGTT 2-hour glucose (LOO R^2 = 0.439). GPI also showed strong day-to-day repeatability (ICC = 0.665) and ranked first on a combined prediction-stability score. For classification of OGTT-defined dysglycemia, HbA1c had a slightly higher AUC than GPI, but GPI plus HbA1c performed best overall, indicating complementary information. Conclusions GPI was a strong predictor of OGTT 2-hour glucose and showed a favorable balance between predictive performance and day-to-day stability, supporting its potential utility as a CGM-derived marker of dysglycemia.
Adebamowo, C.; Adebamowo, S. N. N.; Gbolahan, T.; Ikwueme, O.; Famooto, A.; Owoade, Y.; ACCME Research Group as part of H3Africa Consortium,
Show abstract
Persistent detection of high-risk human papillomavirus (HPV) is required for cervical carcinogenesis, yet the metabolic phenotype associated with distinct HPV transition states remains incompletely defined. We analyzed vaginal metabolomics data from 71 HIV-negative, non-smoking, premenopausal women without other sexually transmitted infections, grouped by three-visit HPV trajectories: persistent negative (NNN, n=20), late incident positivity (NNP, n=9), conversion with persistence (NPP, n=13), clearance after prior positivity (PPN, n=16), and persistent positive (PPP, n=13). After detection-based filtering, 186 putative and 64 quantitatively estimated metabolites were retained for integrated univariate, multivariate, network, pathway, and machine learning analyses. Global class separation was weak by PERMANOVA and by five-class classification, indicating that the vaginal metabolome does not reorganize broadly across all HPV states. In contrast, trajectory-specific signals were reproducible. The strongest pairwise contrast was NNP versus PPP (best cross-validated ROC AUC 0.778; permutation p=0.039). Glycolic acid was the dominant single metabolite, particularly for NNP versus PPP (Mann-Whitney p=6.96x10^-4, FDR=0.0446, AUROC=0.902; detection 88.9% versus 15.4%; combined abundance+detection FDR=0.0010). Persistent positivity was characterized by a focused uracil-high, methyl-donor/redox-low signature, including lower glycolic acid, S-adenosylmethionine, NAD+, and betaine, together with higher uracil. Ratio mining further sharpened discrimination, with uracil/S-adenosylmethionine and uracil/creatinine among the best PPP classifiers, and glucose 1-phosphate/isovaleric acid-valeric acid strongly separating NNP from NPP. These data support a model in which HPV trajectory is encoded by targeted metabolic states rather than a diffuse HPV-positive versus HPV-negative metabolomic shift.
rani, a.; mishra, s.
Show abstract
Accurate histopathological differentiation between High-Grade Serous Carcinoma (HGSC) and Low-Grade Serous Carcinoma (LGSC) remains a critical yet challenging aspect of ovarian cancer diagnosis due to their similar morphology and different clinical outcomes. This study presents a deep learning framework that uses custom attention mechanisms, including the Convolutional Block Attention Module (CBAM), Squeeze-and-Excitation (SE) blocks, and a Differential Attention module within five CNN architectures for automated binary classification of ovarian cancer subtypes from H&E WSI patches. Although individual models achieved higher accuracy, the ensemble stacking framework with a shallow MLP meta-learner delivered the best overall performance, with a ROC-AUC of 0.9211, an accuracy of 0.85, and F1-scores of 0.84 and 0.85 across both subtypes. These findings demonstrate that attention-guided feature recalibration combined with ensemble stacking provides robust and clinically interpretable discrimination of ovarian carcinoma subtypes.